Why Chinese Web-as-Corpus is Wacky? Or: How Big Data is Killing Chinese Corpus Linguistics
نویسنده
چکیده
This paper aims to examine and evaluate the current development of using Web-as-Corpus (WaC) paradigm in Chinese corpus linguistics. I will argue that the unstable notion of wordhood in Chinese and the resulting diverse ideas of implementing word segmentation systems have posed great challenges for those who are keen on building web-scaled corpus data. Two lexical measures are proposed to illustrate the issues and methodological discussions are provided.
منابع مشابه
“Those Nation Wreckers are Suffering from Inferiority Complex”: The Depiction of Chinese Miners in the Ghanaian Press
This article studies the depiction of Chinese miners in the Ghanaian news website entitled Modern Ghana. A total of 87 articles comprising 43752 words were retrieved. Van Leeuwen’s (2008) theory of the representation of the social actors was utilised to examine the depiction of Chinese miners in the Ghanaian press. In this regard, six applicable tools were used and these include exclusion, role...
متن کاملCorpora and English Teaching: Retrospect and Prospect
As a whole system of methods and principles of how to apply corpora in language study, corpus linguistics has revolutionized nearly all branches of linguistics. In the wake of this revolution, people began to rethink language pedagogy from corpus perspective in early 1990s. However, Today, although Corpus Linguistics has contributed much to English education, difficulties do exist, especially i...
متن کاملThe Jinan Chinese Learner Corpus
We present the Jinan Chinese Learner Corpus, a large collection of L2 Chinese texts produced by learners that can be used for educational tasks. The present work introduces the data and provides a detailed description. Currently, the corpus contains approximately 6 million Chinese characters written by students from over 50 different L1 backgrounds. This is a large-scale corpus of learner Chine...
متن کاملChinese Sketch Engine and Mapping Principles: A Corpus-Based Study of Conceptual Metaphors Using the BUILDING Source Domain
The goal of this paper is to use a largescale corpus, i.e. the Gigaword Corpus via the interface of Chinese Sketch Engine, to determine underlying reasons between source and target domain pairings for conceptual metaphors, called Mapping Principles. In particular, we will employ a frequency-based collocational approach to examine metaphors that use the source domain of BUILDING in Mandarin Chin...
متن کاملOn Bias-free Crawling and Representative Web Corpora
In this paper, I present a specialized opensource crawler that can be used to obtain bias-reduced samples from the web. First, I briefly discuss the relevance of bias-reduced web corpus sampling for corpus linguistics. Then, I summarize theoretical results that show how commonly used crawling methods obtain highly biased samples from the web. The theoretical part of the paper is followed by a d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014